deep compression
Deep Compression of Pre-trained Transformer Models
Pre-trained transformer models have achieved remarkable success in natural language processing (NLP) and have recently become competitive alternatives to Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN) in vision and speech tasks, respectively. Due to excellent computational efficiency and scalability, transformer models can be trained on exceedingly large amounts of data; however, model sizes can grow tremendously. As high performance, large-scale, and pre-trained transformer models become available for users to download and fine-tune for customized downstream tasks, the deployment of these models becomes challenging due to the vast amount of operations and large memory footprint. To address this challenge, we introduce methods to deeply compress pre-trained transformer models across three major application domains: NLP, speech, and vision. Specifically, we quantize transformer backbones down to 4-bit and further achieve 50% fine-grained structural sparsity on pre-trained BERT, Wav2vec2.0
Deep Compression of Pre-trained Transformer Models
Pre-trained transformer models have achieved remarkable success in natural language processing (NLP) and have recently become competitive alternatives to Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN) in vision and speech tasks, respectively. Due to excellent computational efficiency and scalability, transformer models can be trained on exceedingly large amounts of data; however, model sizes can grow tremendously. As high performance, large-scale, and pre-trained transformer models become available for users to download and fine-tune for customized downstream tasks, the deployment of these models becomes challenging due to the vast amount of operations and large memory footprint. To address this challenge, we introduce methods to deeply compress pre-trained transformer models across three major application domains: NLP, speech, and vision. Specifically, we quantize transformer backbones down to 4-bit and further achieve 50% fine-grained structural sparsity on pre-trained BERT, Wav2vec2.0
Deep Compression
In their current form, Deep Neural Networks require enormous memory to fund their massive over-parameterization. Classic Neural Networks such as AlexNet and VGG-16 require around 240 and 552 MB, respectively. Many efforts have been made to reduce the file size of Neural Networks, generally relying on techniques such as Weight Pruning or Quantization, or SVD decompositions of Weight Matrices. This paper, Deep Compression, combines Pruning, Quantization, and Huffman encoding into a three stage pipeline that reduces the size of AlexNet by a factor of 35x and VGG-16 by 49x. This results in AlexNet being reduced from 240 to 6.9 MB and VGG-16 from 552 to 11.3 MB.
HAKD: Hardware Aware Knowledge Distillation
Turner, Jack, Crowley, Elliot J., Radu, Valentin, Cano, José, Storkey, Amos, O'Boyle, Michael
Despite recent developments, deploying deep neural networks on resource constrained general purpose hardware remains a significant challenge. There has been much work in developing methods for reshaping neural networks, usually with a focus on minimising total parameter count. These methods are typically developed in a hardware-agnostic manner and do not exploit hardware behaviour. In this paper we propose a new approach, Hardware Aware Knowledge Distillation (HAKD) which uses empirical observations of hardware behaviour to design efficient student networks which are then trained with knowledge distillation. This allows the trade-off between accuracy and performance to be managed explicitly. We have applied this approach across three platforms and evaluated it on two networks, MobileNet and DenseNet, on CIFAR-10. We show that HAKD outperforms Deep Compression and Fisher pruning in terms of size, accuracy and performance.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Weightless: Lossy Weight Encoding For Deep Neural Network Compression
Reagen, Brandon, Gupta, Udit, Adolf, Robert, Mitzenmacher, Michael M., Rush, Alexander M., Wei, Gu-Yeon, Brooks, David
The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.
Deep Learning Reading Group: SqueezeNet
The next paper from our reading group is by Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally and Kurt Keutzer. This paper introduces a small CNN architecture called "SqueezeNet" that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. As you may have noticed with one of our recent posts we're really interested in learning more about the compression of neural network architectures and this paper really stood out. It's no secret that much of deep learning is tied up in the hell that is parameter tuning. This paper makes a case for increased study into the area of convolutional neural network design in order to drastically reduce the number of parameters you have to deal with.
Compressing and regularizing deep neural networks
Deep neural networks have evolved to be the state-of-the-art technique for machine learning tasks ranging from computer vision and speech recognition to natural language processing. However, deep learning algorithms are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, deep compression significantly reduces the computation and storage required by neural networks. For example, for a convolutional neural network with fully connected layers, such as Alexnet and VGGnet, it can reduce the model size by 35x-49x. Even for fully convolutional neural networks such as GoogleNet and SqueezeNet, deep compression can still reduce the model size by 10x.
QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures
We present QuickNet, a fast and accurate network architecture that is both faster and significantly more accurate than other fast deep architectures like SqueezeNet. Furthermore, it uses less parameters than previous networks, making it more memory efficient. We do this by making two major modifications to the reference Darknet model (Redmon et al, 2015): 1) The use of depthwise separable convolutions and 2) The use of parametric rectified linear units. We make the observation that parametric rectified linear units are computationally equivalent to leaky rectified linear units at test time and the observation that separable convolutions can be interpreted as a compressed Inception network (Chollet, 2016). Using these observations, we derive a network architecture, which we call QuickNet, that is both faster and more accurate than previous models. Our architecture provides at least four major advantages: (1) A smaller model size, which is more tenable on memory constrained systems; (2) A significantly faster network which is more tenable on computationally constrained systems; (3) A high accuracy of 95.7 percent on the CIFAR-10 Dataset which outperforms all but one result published so far, although we note that our works are orthogonal approaches and can be combined (4) Orthogonality to previous model compression approaches allowing for further speed gains to be realized.
Compressing and regularizing deep neural networks
Deep neural networks have evolved to be the state-of-the-art technique for machine learning tasks ranging from computer vision and speech recognition to natural language processing. However, deep learning algorithms are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, deep compression significantly reduces the computation and storage required by neural networks. For example, for a convolutional neural network with fully connected layers, such as Alexnet and VGGnet, it can reduce the model size by 35x-49x. Even for fully convolutional neural networks such as GoogleNet and SqueezeNet, deep compression can still reduce the model size by 10x.
Compressing and regularizing deep neural networks
Deep neural networks have evolved to be the state-of-the-art technique for machine learning tasks ranging from computer vision and speech recognition to natural language processing. However, deep learning algorithms are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, deep compression significantly reduces the computation and storage required by neural networks. For example, for a convolutional neural network with fully connected layers, such as Alexnet and VGGnet, it can reduce the model size by 35x-49x. Even for fully convolutional neural networks such as GoogleNet and SqueezeNet, deep compression can still reduce the model size by 10x.